Introduction:

In this section of our analysis, we will take an unsupervised machine learning approach and look for natural clusters / segments within our world happiness data. We have already seen some initial correlations between economic indicators and happiness, so we expect to encounter some stable clusters.

We will utilize the k-means clustering algorithm and perform a bootstrap validation to verify cluster stability.

Distance Measures:

HeatMap:

Assessing Clustering Tendency:

## $hopkins_stat
## [1] 0.3154874
## 
## $plot

With a relatively low Hopkins statistic, we can conclude that this dataset is not inherently clusterable. We can produce clusters, but it is likely that the boundaries between clusters will be softly defined. This is expected given the variety of indicies we are measuring and the inherent heterogenity of world countries. We expect to achieve better clustering results when applied to the output of PCA / Factor Analysis vs. our relatively raw dataset.

Standard K-Means:

Determining Optimal Number of Clusters with K-Means Approach:

K-Means Clustering

set.seed(24286)
km.res <- kmeans(cluster_df, 7, nstart = 25)
# Visualize
library("factoextra")
fviz_cluster(km.res, data = cluster_df,
             ellipse.type = "convex",
             palette = "jco",
             ggtheme = theme_minimal())

PAM Clustering:

Determining Optimal Number of Clusters with PAM Approach:

PAM Clustering:

Hierarchical Clustering:

## Warning in if (color == "cluster") color <- "default": the condition has
## length > 1 and only the first element will be used

Hybrid Hierarchical and K-Means Approach:

Fuzzy Clustering Approach:

## Warning in fanny(cluster_df, 2): the memberships are all very close to 1/k.
## Maybe decrease 'memb.exp' ?

##   cluster size ave.sil.width
## 1       1   70          0.30
## 2       2   62          0.29

HCPC (Clustering on PCA Output):

Final Evaluation:

Cluster Internal Stability:

## 
## Clustering Methods:
##  hierarchical kmeans pam 
## 
## Cluster sizes:
##  2 3 4 5 6 7 8 9 10 
## 
## Validation Measures:
##                                   2        3        4        5        6        7        8        9       10
##                                                                                                            
## hierarchical Connectivity   11.5111  17.6421  24.7548  29.1127  36.5270  39.9643  46.9325  47.9325  56.2524
##              Dunn            0.2787   0.2848   0.2911   0.2911   0.3372   0.3372   0.3966   0.3966   0.4276
##              Silhouette      0.3203   0.2573   0.2538   0.2286   0.2465   0.2374   0.2386   0.2251   0.2021
## kmeans       Connectivity   23.4702  43.3377  46.2397  41.0754  58.3714  60.6087  69.1726  81.1623  78.3413
##              Dunn            0.2194   0.1660   0.1997   0.3156   0.2698   0.2730   0.2619   0.3054   0.3624
##              Silhouette      0.3203   0.2445   0.2565   0.2514   0.2471   0.2405   0.2120   0.1930   0.1859
## pam          Connectivity   34.2468  45.0520  42.3103  47.1429  78.7536  88.0032 101.4996  98.8163 113.6448
##              Dunn            0.2194   0.1980   0.2649   0.3089   0.3068   0.3068   0.3068   0.3138   0.3138
##              Silhouette      0.3141   0.2298   0.2322   0.2418   0.1996   0.1915   0.1700   0.1730   0.1472
## 
## Optimal Scores:
## 
##              Score   Method       Clusters
## Connectivity 11.5111 hierarchical 2       
## Dunn          0.4276 hierarchical 10      
## Silhouette    0.3203 hierarchical 2

Cluster Other Stability Measures: